|

1.

RCPA: An Open-Source R Package for Data Processing, Differential Analysis, Consensus Pathway Analysis, and Visualization.

Nguyen, Hung; Nguyen, Ha; Maghsoudi, Zeynab; Tran, Bang; Draghici, Sorin; Nguyen, Tin.

Curr Protoc ; 4(5): e1036, 2024 May.

Article En | MEDLINE | ID: mdl-38713133

Identifying impacted pathways is important because it provides insights into the biology underlying conditions beyond the detection of differentially expressed genes. Because of the importance of such analysis, more than 100 pathway analysis methods have been developed thus far. Despite the availability of many methods, it is challenging for biomedical researchers to learn and properly perform pathway analysis. First, the sheer number of methods makes it challenging to learn and choose the correct method for a given experiment. Second, computational methods require users to be savvy with coding syntax, and comfortable with command-line environments, areas that are unfamiliar to most life scientists. Third, as learning tools and computational methods are typically implemented only for a few species (i.e., human and some model organisms), it is difficult to perform pathway analysis on other species that are not included in many of the current pathway analysis tools. Finally, existing pathway tools do not allow researchers to combine, compare, and contrast the results of different methods and experiments for both hypothesis testing and analysis purposes. To address these challenges, we developed an open-source R package for Consensus Pathway Analysis (RCPA) that allows researchers to conveniently: (1) download and process data from NCBI GEO; (2) perform differential analysis using established techniques developed for both microarray and sequencing data; (3) perform both gene set enrichment, as well as topology-based pathway analysis using different methods that seek to answer different research hypotheses; (4) combine methods and datasets to find consensus results; and (5) visualize analysis results and explore significantly impacted pathways across multiple analyses. This protocol provides many example code snippets with detailed explanations and supports the analysis of more than 1000 species, two pathway databases, three differential analysis techniques, eight pathway analysis tools, six meta-analysis methods, and two consensus analysis techniques. The package is freely available on the CRAN repository. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Processing Affymetrix microarrays Basic Protocol 2: Processing Agilent microarrays Support Protocol: Processing RNA sequencing (RNA-Seq) data Basic Protocol 3: Differential analysis of microarray data (Affymetrix and Agilent) Basic Protocol 4: Differential analysis of RNA-Seq data Basic Protocol 5: Gene set enrichment analysis Basic Protocol 6: Topology-based (TB) pathway analysis Basic Protocol 7: Data integration and visualization.

Computational Biology , Software , Humans , Computational Biology/methods , Gene Expression Profiling/methods

2.

SNP-SVant: A Computational Workflow to Predict and Annotate Genomic Variants in Organisms Lacking Benchmarked Variants.

Gunasekaran, Deepika; Ardell, David H; Nobile, Clarissa J.

Curr Protoc ; 4(5): e1046, 2024 May.

Article En | MEDLINE | ID: mdl-38717471

Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.

Polymorphism, Single Nucleotide , Software , Workflow , Polymorphism, Single Nucleotide/genetics , Computational Biology/methods , Genomics/methods , Molecular Sequence Annotation/methods , Whole Genome Sequencing/methods

3.

Integrated approach to generate artificial samples with low tumor fraction for somatic variant calling benchmarking.

Sergi, Aldo; Beltrame, Luca; Marchini, Sergio; Masseroli, Marco.

BMC Bioinformatics ; 25(1): 180, 2024 May 08.

Article En | MEDLINE | ID: mdl-38720249

BACKGROUND: High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. RESULTS: Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. CONCLUSIONS: Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques.

Benchmarking , High-Throughput Nucleotide Sequencing , Neoplasms , High-Throughput Nucleotide Sequencing/methods , Humans , Neoplasms/genetics , Mutation , Algorithms , DNA, Neoplasm/genetics , Sequence Analysis, DNA/methods , Computational Biology/methods

4.

Disregarding multimappers leads to biases in the functional assessment of NGS data.

Almeida da Paz, Michelle; Warger, Sarah; Taher, Leila.

BMC Genomics ; 25(1): 455, 2024 May 08.

Article En | MEDLINE | ID: mdl-38720252

BACKGROUND: Standard ChIP-seq and RNA-seq processing pipelines typically disregard sequencing reads whose origin is ambiguous ("multimappers"). This usual practice has potentially important consequences for the functional interpretation of the data: genomic elements belonging to clusters composed of highly similar members are left unexplored. RESULTS: In particular, disregarding multimappers leads to the underrepresentation in epigenetic studies of recently active transposable elements, such as AluYa5, L1HS and SVAs. Furthermore, this common strategy also has implications for transcriptomic analysis: members of repetitive gene families, such the ones including major histocompatibility complex (MHC) class I and II genes, are under-quantified. CONCLUSION: Revealing inherent biases that permeate routine tasks such as functional enrichment analysis, our results underscore the urgency of broadly adopting multimapper-aware bioinformatic pipelines -currently restricted to specific contexts or communities- to ensure the reliability of genomic and transcriptomic studies.

High-Throughput Nucleotide Sequencing , Humans , DNA Transposable Elements/genetics , Computational Biology/methods , Gene Expression Profiling/methods , Genomics/methods , Sequence Analysis, RNA/methods

5.

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

Van, Richard; Alvarez, Daniel; Mize, Travis; Gannavarapu, Sravani; Chintham Reddy, Lohitha; Nasoz, Fatma; Han, Mira V.

BMC Bioinformatics ; 25(1): 181, 2024 May 08.

Article En | MEDLINE | ID: mdl-38720247

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Machine Learning , Neoplasms , RNA-Seq , Humans , RNA-Seq/methods , Neoplasms/genetics , Transcriptome/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Computational Biology/methods

6.

ARGNet: using deep neural networks for robust identification and classification of antibiotic resistance genes from sequences.

Pei, Yao; Shum, Marcus Ho-Hin; Liao, Yunshi; Leung, Vivian W; Gong, Yu-Nong; Smith, David K; Yin, Xiaole; Guan, Yi; Luo, Ruibang; Zhang, Tong; Lam, Tommy Tsan-Yuk.

Microbiome ; 12(1): 84, 2024 May 09.

Article En | MEDLINE | ID: mdl-38725076

BACKGROUND: Emergence of antibiotic resistance in bacteria is an important threat to global health. Antibiotic resistance genes (ARGs) are some of the key components to define bacterial resistance and their spread in different environments. Identification of ARGs, particularly from high-throughput sequencing data of the specimens, is the state-of-the-art method for comprehensively monitoring their spread and evolution. Current computational methods to identify ARGs mainly rely on alignment-based sequence similarities with known ARGs. Such approaches are limited by choice of reference databases and may potentially miss novel ARGs. The similarity thresholds are usually simple and could not accommodate variations across different gene families and regions. It is also difficult to scale up when sequence data are increasing. RESULTS: In this study, we developed ARGNet, a deep neural network that incorporates an unsupervised learning autoencoder model to identify ARGs and a multiclass classification convolutional neural network to classify ARGs that do not depend on sequence alignment. This approach enables a more efficient discovery of both known and novel ARGs. ARGNet accepts both amino acid and nucleotide sequences of variable lengths, from partial (30-50 aa; 100-150 nt) sequences to full-length protein or genes, allowing its application in both target sequencing and metagenomic sequencing. Our performance evaluation showed that ARGNet outperformed other deep learning models including DeepARG and HMD-ARG in most of the application scenarios especially quasi-negative test and the analysis of prediction consistency with phylogenetic tree. ARGNet has a reduced inference runtime by up to 57% relative to DeepARG. CONCLUSIONS: ARGNet is flexible, efficient, and accurate at predicting a broad range of ARGs from the sequencing data. ARGNet is freely available at https://github.com/id-bioinfo/ARGNet , with an online service provided at https://ARGNet.hku.hk . Video Abstract.

Bacteria , Neural Networks, Computer , Bacteria/genetics , Bacteria/drug effects , Bacteria/classification , Drug Resistance, Bacterial/genetics , Anti-Bacterial Agents/pharmacology , High-Throughput Nucleotide Sequencing/methods , Computational Biology/methods , Genes, Bacterial/genetics , Drug Resistance, Microbial/genetics , Humans , Deep Learning

7.

Unraveling the immunogenetic landscape of autism spectrum disorder: a comprehensive bioinformatics approach.

Ma, Jieying; Liu, Deyang; Zhao, Jianzhong; Fang, Xiaolu; Bu, Dengyin.

Front Immunol ; 15: 1347139, 2024.

Article En | MEDLINE | ID: mdl-38726016

Background: Autism spectrum disorder (ASD) is a disease characterized by social disorder. Recently, the population affected by ASD has gradually increased around the world. There are great difficulties in diagnosis and treatment at present. Methods: The ASD datasets were obtained from the Gene Expression Omnibus database and the immune-relevant genes were downloaded from a previously published compilation. Subsequently, we used WGCNA to screen the modules related to the ASD and immune. We also choose the best combination and screen out the core genes from Consensus Machine Learning Driven Signatures (CMLS). Subsequently, we evaluated the genetic correlation between immune cells and ASD used GNOVA. And pleiotropic regions identified by PLACO and CPASSOC between ASD and immune cells. FUMA was used to identify pleiotropic regions, and expression trait loci (EQTL) analysis was used to determine their expression in different tissues and cells. Finally, we use qPCR to detect the gene expression level of the core gene. Results: We found a close relationship between neutrophils and ASD, and subsequently, CMLS identified a total of 47 potential candidate genes. Secondly, GNOVA showed a significant genetic correlation between neutrophils and ASD, and PLACO and CPASSOC identified a total of 14 pleiotropic regions. We annotated the 14 regions mentioned above and identified a total of 6 potential candidate genes. Through EQTL, we found that the CFLAR gene has a specific expression pattern in neutrophils, suggesting that it may serve as a potential biomarker for ASD and is closely related to its pathogenesis. Conclusions: In conclusion, our study yields unprecedented insights into the molecular and genetic heterogeneity of ASD through a comprehensive bioinformatics analysis. These valuable findings hold significant implications for tailoring personalized ASD therapies.

Autism Spectrum Disorder , Computational Biology , Genetic Predisposition to Disease , Quantitative Trait Loci , Humans , Autism Spectrum Disorder/genetics , Autism Spectrum Disorder/immunology , Computational Biology/methods , Gene Expression Profiling , Gene Regulatory Networks , Machine Learning , Databases, Genetic , Immunogenetics , Neutrophils/immunology , Neutrophils/metabolism , Transcriptome

8.

FAIR-USE4OS: Guidelines for creating impactful open-source software.

Sonabend, Raphael; Gruson, Hugo; Wolansky, Leo; Kiragga, Agnes; Katz, Daniel S.

PLoS Comput Biol ; 20(5): e1012045, 2024 May.

Article En | MEDLINE | ID: mdl-38722873

This paper extends the FAIR (Findable, Accessible, Interoperable, Reusable) guidelines to provide criteria for assessing if software conforms to best practices in open source. By adding "USE" (User-Centered, Sustainable, Equitable), software development can adhere to open source best practice by incorporating user-input early on, ensuring front-end designs are accessible to all possible stakeholders, and planning long-term sustainability alongside software design. The FAIR-USE4OS guidelines will allow funders and researchers to more effectively evaluate and plan open-source software projects. There is good evidence of funders increasingly mandating that all funded research software is open source; however, even under the FAIR guidelines, this could simply mean software released on public repositories with a Zenodo DOI. By creating FAIR-USE software, best practice can be demonstrated from the very beginning of the design process and the software has the greatest chance of success by being impactful.

Guidelines as Topic , Software , Computational Biology/methods , Software Design , Humans

9.

AMPActiPred: A three-stage framework for predicting antibacterial peptides and activity levels with deep forest.

Yao, Lantian; Guan, Jiahui; Xie, Peilin; Chung, Chia-Ru; Deng, Junyang; Huang, Yixian; Chiang, Ying-Chih; Lee, Tzong-Yi.

Protein Sci ; 33(6): e5006, 2024 Jun.

Article En | MEDLINE | ID: mdl-38723168

The emergence and spread of antibiotic-resistant bacteria pose a significant public health threat, necessitating the exploration of alternative antibacterial strategies. Antibacterial peptide (ABP) is a kind of antimicrobial peptide (AMP) that has the potential ability to fight against bacteria infection, offering a promising avenue for developing novel therapeutic interventions. This study introduces AMPActiPred, a three-stage computational framework designed to identify ABPs, characterize their activity against diverse bacterial species, and predict their activity levels. AMPActiPred employed multiple effective peptide descriptors to effectively capture the compositional features and physicochemical properties of peptides. AMPActiPred utilized deep forest architecture, a cascading architecture similar to deep neural networks, capable of effectively processing and exploring original features to enhance predictive performance. In the first stage, AMPActiPred focuses on ABP identification, achieving an Accuracy of 87.6% and an MCC of 0.742 on an elaborate dataset, demonstrating state-of-the-art performance. In the second stage, AMPActiPred achieved an average GMean at 82.8% in identifying ABPs targeting 10 bacterial species, indicating AMPActiPred can achieve balanced predictions regarding the functional activity of ABP across this set of species. In the third stage, AMPActiPred demonstrates robust predictive capabilities for ABP activity levels with an average PCC of 0.722. Furthermore, AMPActiPred exhibits excellent interpretability, elucidating crucial features associated with antibacterial activity. AMPActiPred is the first computational framework capable of predicting targets and activity levels of ABPs. Finally, to facilitate the utilization of AMPActiPred, we have established a user-friendly web interface deployed at https://awi.cuhk.edu.cn/â¼AMPActiPred/.

Anti-Bacterial Agents , Anti-Bacterial Agents/pharmacology , Anti-Bacterial Agents/chemistry , Antimicrobial Peptides/chemistry , Antimicrobial Peptides/pharmacology , Bacteria/drug effects , Computational Biology/methods , Neural Networks, Computer , Microbial Sensitivity Tests

10.

CASi: A framework for cross-timepoint analysis of single-cell RNA sequencing data.

Wang, Yizhuo; Flowers, Christopher R; Wang, Michael; Huang, Xuelin; Li, Ziyi.

Sci Rep ; 14(1): 10633, 2024 05 09.

Article En | MEDLINE | ID: mdl-38724550

Single-cell RNA sequencing (scRNA-seq) technology has been widely used to study the differences in gene expression at the single cell level, providing insights into the research of cell development, differentiation, and functional heterogeneity. Various pipelines and workflows of scRNA-seq analysis have been developed but few considered multi-timepoint data specifically. In this study, we develop CASi, a comprehensive framework for analyzing multiple timepoints' scRNA-seq data, which provides users with: (1) cross-timepoint cell annotation, (2) detection of potentially novel cell types emerged over time, (3) visualization of cell population evolution, and (4) identification of temporal differentially expressed genes (tDEGs). Through comprehensive simulation studies and applications to a real multi-timepoint single cell dataset, we demonstrate the robust and favorable performance of the proposal versus existing methods serving similar purposes.

Sequence Analysis, RNA , Single-Cell Analysis , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Humans , Gene Expression Profiling/methods , Software , Computational Biology/methods

11.

Identification of candidate biomarkers for GBM based on WGCNA.

Sun, Qinghui; Wang, Zheng; Xiu, Hao; He, Na; Liu, Mingyu; Yin, Li.

Sci Rep ; 14(1): 10692, 2024 05 10.

Article En | MEDLINE | ID: mdl-38724609

Glioblastoma multiforme (GBM), the most aggressive form of primary brain tumor, poses a considerable challenge in neuro-oncology. Despite advancements in therapeutic approaches, the prognosis for GBM patients remains bleak, primarily attributed to its inherent resistance to conventional treatments and a high recurrence rate. The primary goal of this study was to acquire molecular insights into GBM by constructing a gene co-expression network, aiming to identify and predict key genes and signaling pathways associated with this challenging condition. To investigate differentially expressed genes between various grades of Glioblastoma (GBM), we employed Weighted Gene Co-expression Network Analysis (WGCNA) methodology. Through this approach, we were able to identify modules with specific expression patterns in GBM. Next, genes from these modules were performed Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis using ClusterProfiler package. Our findings revealed a negative correlation between biological processes associated with neuronal development and functioning and GBM. Conversely, the processes related to the cell cycle, glomerular development, and ECM-receptor interaction exhibited a positive correlation with GBM. Subsequently, hub genes, including SYP, TYROBP, and ANXA5, were identified. This study offers a comprehensive overview of the existing research landscape on GBM, underscoring the challenges encountered by clinicians and researchers in devising effective therapeutic strategies.

Biomarkers, Tumor , Brain Neoplasms , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks , Glioblastoma , Humans , Glioblastoma/genetics , Glioblastoma/pathology , Glioblastoma/metabolism , Biomarkers, Tumor/genetics , Biomarkers, Tumor/metabolism , Brain Neoplasms/genetics , Brain Neoplasms/pathology , Brain Neoplasms/metabolism , Gene Ontology , Computational Biology/methods

12.

GNNGL-PPI: multi-category prediction of protein-protein interactions using graph neural networks based on global graphs and local subgraphs.

Zeng, Xin; Meng, Fan-Fang; Wen, Meng-Liang; Li, Shu-Juan; Li, Yi.

BMC Genomics ; 25(1): 406, 2024 May 09.

Article En | MEDLINE | ID: mdl-38724906

Most proteins exert their functions by interacting with other proteins, making the identification of protein-protein interactions (PPI) crucial for understanding biological activities, pathological mechanisms, and clinical therapies. Developing effective and reliable computational methods for predicting PPI can significantly reduce the time-consuming and labor-intensive associated traditional biological experiments. However, accurately identifying the specific categories of protein-protein interactions and improving the prediction accuracy of the computational methods remain dual challenges. To tackle these challenges, we proposed a novel graph neural network method called GNNGL-PPI for multi-category prediction of PPI based on global graphs and local subgraphs. GNNGL-PPI consisted of two main components: using Graph Isomorphism Network (GIN) to extract global graph features from PPI network graph, and employing GIN As Kernel (GIN-AK) to extract local subgraph features from the subgraphs of protein vertices. Additionally, considering the imbalanced distribution of samples in each category within the benchmark datasets, we introduced an Asymmetric Loss (ASL) function to further enhance the predictive performance of the method. Through evaluations on six benchmark test sets formed by three different dataset partitioning algorithms (Random, BFS, DFS), GNNGL-PPI outperformed the state-of-the-art multi-category prediction methods of PPI, as measured by the comprehensive performance evaluation metric F1-measure. Furthermore, interpretability analysis confirmed the effectiveness of GNNGL-PPI as a reliable multi-category prediction method for predicting protein-protein interactions.

Algorithms , Computational Biology , Neural Networks, Computer , Protein Interaction Mapping , Protein Interaction Mapping/methods , Computational Biology/methods , Protein Interaction Maps , Humans , Proteins/metabolism

13.

OMD Curation Toolkit: a workflow for in-house curation of public omics datasets.

Piquer-Esteban, Samuel; Arnau, Vicente; Diaz, Wladimiro; Moya, Andrés.

BMC Bioinformatics ; 25(1): 184, 2024 May 09.

Article En | MEDLINE | ID: mdl-38724907

BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing. RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources. CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.

Data Curation , Software , Workflow , Data Curation/methods , Metadata , Databases, Genetic , Genomics/methods , Computational Biology/methods

14.

DCGAN-DTA: Predicting drug-target binding affinity with deep convolutional generative adversarial networks.

Kalemati, Mahmood; Zamani Emani, Mojtaba; Koohi, Somayyeh.

BMC Genomics ; 25(1): 411, 2024 May 09.

Article En | MEDLINE | ID: mdl-38724911

BACKGROUND: In recent years, there has been a growing interest in utilizing computational approaches to predict drug-target binding affinity, aiming to expedite the early drug discovery process. To address the limitations of experimental methods, such as cost and time, several machine learning-based techniques have been developed. However, these methods encounter certain challenges, including the limited availability of training data, reliance on human intervention for feature selection and engineering, and a lack of validation approaches for robust evaluation in real-life applications. RESULTS: To mitigate these limitations, in this study, we propose a method for drug-target binding affinity prediction based on deep convolutional generative adversarial networks. Additionally, we conducted a series of validation experiments and implemented adversarial control experiments using straw models. These experiments serve to demonstrate the robustness and efficacy of our predictive models. We conducted a comprehensive evaluation of our method by comparing it to baselines and state-of-the-art methods. Two recently updated datasets, namely the BindingDB and PDBBind, were used for this purpose. Our findings indicate that our method outperforms the alternative methods in terms of three performance measures when using warm-start data splitting settings. Moreover, when considering physiochemical-based cold-start data splitting settings, our method demonstrates superior predictive performance, particularly in terms of the concordance index. CONCLUSION: The results of our study affirm the practical value of our method and its superiority over alternative approaches in predicting drug-target binding affinity across multiple validation sets. This highlights the potential of our approach in accelerating drug repurposing efforts, facilitating novel drug discovery, and ultimately enhancing disease treatment. The data and source code for this study were deposited in the GitHub repository, https://github.com/mojtabaze7/DCGAN-DTA . Furthermore, the web server for our method is accessible at https://dcgan.shinyapps.io/bindingaffinity/ .

Drug Discovery , Drug Discovery/methods , Computational Biology/methods , Humans , Neural Networks, Computer , Protein Binding , Machine Learning

15.

Prediction of anticancer drug sensitivity using an interpretable model guided by deep learning.

Pang, Weixiong; Chen, Ming; Qin, Yufang.

BMC Bioinformatics ; 25(1): 182, 2024 May 09.

Article En | MEDLINE | ID: mdl-38724920

BACKGROUND: The prediction of drug sensitivity plays a crucial role in improving the therapeutic effect of drugs. However, testing the effectiveness of drugs is challenging due to the complex mechanism of drug reactions and the lack of interpretability in most machine learning and deep learning methods. Therefore, it is imperative to establish an interpretable model that receives various cell line and drug feature data to learn drug response mechanisms and achieve stable predictions between available datasets. RESULTS: This study proposes a new and interpretable deep learning model, DrugGene, which integrates gene expression, gene mutation, gene copy number variation of cancer cells, and chemical characteristics of anticancer drugs to predict their sensitivity. This model comprises two different branches of neural networks, where the first involves a hierarchical structure of biological subsystems that uses the biological processes of human cells to form a visual neural network (VNN) and an interpretable deep neural network for human cancer cells. DrugGene receives genotype input from the cell line and detects changes in the subsystem states. We also employ a traditional artificial neural network (ANN) to capture the chemical structural features of drugs. DrugGene generates final drug response predictions by combining VNN and ANN and integrating their outputs into a fully connected layer. The experimental results using drug sensitivity data extracted from the Cancer Drug Sensitivity Genome Database and the Cancer Treatment Response Portal v2 reveal that the proposed model is better than existing prediction methods. Therefore, our model achieves higher accuracy, learns the reaction mechanisms between anticancer drugs and cell lines from various features, and interprets the model's predicted results. CONCLUSIONS: Our method utilizes biological pathways to construct neural networks, which can use genotypes to monitor changes in the state of network subsystems, thereby interpreting the prediction results in the model and achieving satisfactory prediction accuracy. This will help explore new directions in cancer treatment. More available code resources can be downloaded for free from GitHub ( https://github.com/pangweixiong/DrugGene ).

Antineoplastic Agents , Deep Learning , Neural Networks, Computer , Humans , Antineoplastic Agents/pharmacology , Neoplasms/drug therapy , Neoplasms/genetics , Cell Line, Tumor , DNA Copy Number Variations , Computational Biology/methods

16.

Data-driven selection of analysis decisions in single-cell RNA-seq trajectory inference.

Dong, Xiaoru; Leary, Jack R; Yang, Chuanhao; Brusko, Maigan A; Brusko, Todd M; Bacher, Rhonda.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38725155

Single-cell RNA sequencing (scRNA-seq) experiments have become instrumental in developmental and differentiation studies, enabling the profiling of cells at a single or multiple time-points to uncover subtle variations in expression profiles reflecting underlying biological processes. Benchmarking studies have compared many of the computational methods used to reconstruct cellular dynamics; however, researchers still encounter challenges in their analysis due to uncertainty with respect to selecting the most appropriate methods and parameters. Even among universal data processing steps used by trajectory inference methods such as feature selection and dimension reduction, trajectory methods' performances are highly dataset-specific. To address these challenges, we developed Escort, a novel framework for evaluating a dataset's suitability for trajectory inference and quantifying trajectory properties influenced by analysis decisions. Escort evaluates the suitability of trajectory analysis and the combined effects of processing choices using trajectory-specific metrics. Escort navigates single-cell trajectory analysis through these data-driven assessments, reducing uncertainty and much of the decision burden inherent to trajectory inference analyses. Escort is implemented in an accessible R package and R/Shiny application, providing researchers with the necessary tools to make informed decisions during trajectory analysis and enabling new insights into dynamic biological processes at single-cell resolution.

RNA-Seq , Single-Cell Analysis , Single-Cell Analysis/methods , RNA-Seq/methods , Humans , Computational Biology/methods , Sequence Analysis, RNA/methods , Software , Algorithms , Gene Expression Profiling/methods , Single-Cell Gene Expression Analysis

17.

Contrastive learning for enhancing feature extraction in anticancer peptides.

Lee, Byungjo; Shin, Dongkwan.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38725157

Cancer, recognized as a primary cause of death worldwide, has profound health implications and incurs a substantial social burden. Numerous efforts have been made to develop cancer treatments, among which anticancer peptides (ACPs) are garnering recognition for their potential applications. While ACP screening is time-consuming and costly, in silico prediction tools provide a way to overcome these challenges. Herein, we present a deep learning model designed to screen ACPs using peptide sequences only. A contrastive learning technique was applied to enhance model performance, yielding better results than a model trained solely on binary classification loss. Furthermore, two independent encoders were employed as a replacement for data augmentation, a technique commonly used in contrastive learning. Our model achieved superior performance on five of six benchmark datasets against previous state-of-the-art models. As prediction tools advance, the potential in peptide-based cancer therapeutics increases, promising a brighter future for oncology research and patient care.

Antineoplastic Agents , Deep Learning , Peptides , Peptides/chemistry , Peptides/therapeutic use , Humans , Antineoplastic Agents/therapeutic use , Antineoplastic Agents/chemistry , Neoplasms/drug therapy , Computational Biology/methods , Machine Learning , Algorithms

18.

TransPTM: a transformer-based model for non-histone acetylation site prediction.

Meng, Lingkuan; Chen, Xingjian; Cheng, Ke; Chen, Nanjun; Zheng, Zetian; Wang, Fuzhou; Sun, Hongyan; Wong, Ka-Chun.

Brief Bioinform ; 25(3)2024 Mar 27.

Article En | MEDLINE | ID: mdl-38725156

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.

Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Computational Biology/methods , Databases, Protein , Software , Algorithms , Humans , Proteins/chemistry , Proteins/metabolism

19.

Identification of Breast Cancer Subtypes Based on Endoplasmic Reticulum Stress-Related Genes and Analysis of Prognosis and Immune Microenvironment in Breast Cancer Patients.

Yi, Chen; Yang, Jun; Zhang, Ting; Qin, Liu; Chen, Dongjuan.

Technol Cancer Res Treat ; 23: 15330338241241484, 2024.

Article En | MEDLINE | ID: mdl-38725284

Introduction: Endoplasmic reticulum stress (ERS) was a response to the accumulation of unfolded proteins and plays a crucial role in the development of tumors, including processes such as tumor cell invasion, metastasis, and immune evasion. However, the specific regulatory mechanisms of ERS in breast cancer (BC) remain unclear. Methods: In this study, we analyzed RNA sequencing data from The Cancer Genome Atlas (TCGA) for breast cancer and identified 8 core genes associated with ERS: ELOVL2, IFNG, MAP2K6, MZB1, PCSK6, PCSK9, IGF2BP1, and POP1. We evaluated their individual expression, independent diagnostic, and prognostic values in breast cancer patients. A multifactorial Cox analysis established a risk prognostic model, validated with an external dataset. Additionally, we conducted a comprehensive assessment of immune infiltration and drug sensitivity for these genes. Results: The results indicate that these eight core genes play a crucial role in regulating the immune microenvironment of breast cancer (BRCA) patients. Meanwhile, an independent diagnostic model based on the expression of these eight genes shows limited independent diagnostic value, and its independent prognostic value is unsatisfactory, with the time ROC AUC values generally below 0.5. According to the results of logistic regression neural networks and risk prognosis models, when these eight genes interact synergistically, they can serve as excellent biomarkers for the diagnosis and prognosis of breast cancer patients. Furthermore, the research findings have been confirmed through qPCR experiments and validation. Conclusion: In conclusion, we explored the mechanisms of ERS in BRCA patients and identified 8 outstanding biomolecular diagnostic markers and prognostic indicators. The research results were double-validated using the GEO database and qPCR.

Biomarkers, Tumor , Breast Neoplasms , Endoplasmic Reticulum Stress , Gene Expression Regulation, Neoplastic , Tumor Microenvironment , Humans , Female , Tumor Microenvironment/immunology , Tumor Microenvironment/genetics , Breast Neoplasms/genetics , Breast Neoplasms/immunology , Breast Neoplasms/pathology , Prognosis , Endoplasmic Reticulum Stress/genetics , Biomarkers, Tumor/genetics , Gene Expression Profiling , Computational Biology/methods , Databases, Genetic , ROC Curve , Kaplan-Meier Estimate , Transcriptome

20.

Multi-Input data ASsembly for joint Analysis (MIASA): A framework for the joint analysis of disjoint sets of variables.

Raharinirina, Nomenjanahary Alexia; Sunkara, Vikram; von Kleist, Max; Fackeldey, Konstantin; Weber, Marcus.

PLoS One ; 19(5): e0302425, 2024.

Article En | MEDLINE | ID: mdl-38728301

The joint analysis of two datasets [Formula: see text] and [Formula: see text] that describe the same phenomena (e.g. the cellular state), but measure disjoint sets of variables (e.g. mRNA vs. protein levels) is currently challenging. Traditional methods typically analyze single interaction patterns such as variance or covariance. However, problem-tailored external knowledge may contain multiple different information about the interaction between the measured variables. We introduce MIASA, a holistic framework for the joint analysis of multiple different variables. It consists of assembling multiple different information such as similarity vs. association, expressed in terms of interaction-scores or distances, for subsequent clustering/classification. In addition, our framework includes a novel qualitative Euclidean embedding method (qEE-Transition) which enables using Euclidean-distance/vector-based clustering/classification methods on datasets that have a non-Euclidean-based interaction structure. As an alternative to conventional optimization-based multidimensional scaling methods which are prone to uncertainties, our qEE-Transition generates a new vector representation for each element of the dataset union [Formula: see text] in a common Euclidean space while strictly preserving the original ordering of the assembled interaction-distances. To demonstrate our work, we applied the framework to three types of simulated datasets: samples from families of distributions, samples from correlated random variables, and time-courses of statistical moments for three different types of stochastic two-gene interaction models. We then compared different clustering methods with vs. without the qEE-Transition. For all examples, we found that the qEE-Transition followed by Ward clustering had superior performance compared to non-agglomerative clustering methods but had a varied performance against ultrametric-based agglomerative methods. We also tested the qEE-Transition followed by supervised and unsupervised machine learning methods and found promising results, however, more work is needed for optimal parametrization of these methods. As a future perspective, our framework points to the importance of more developments and validation of distance-distribution models aiming to capture multiple-complex interactions between different variables.

Algorithms , Cluster Analysis , Humans , Computational Biology/methods